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Abstract 


V  - 

The  robustness  of  validity  of  four  methods  for  setting  confidence 
intervals  for  a  location  parameter  when  the  scale  is  unknown  are 
investigated.  Three  methods  involve  estimating  the  variance  of  an  M- 
estimate  of  location  while  the  fourth  is  a  procedure  suggested  by  Maritz, 
based  on  a  permutation  argument.  The  first  three  methods  use  either 
a  finite  sample  approximation  to  the  asymptotic  variance  (a  well- 
known  standard)  or  make  inferences  on  the  basis  of  the  shape  of  the 
putative  likelihood  function.  The  latter  approach  is  related  to  the  work 
of  Sprott,  as  well  as  that  of  Efron  and  Hinkley  on  conditional  inference. 
Overall,  the  Maritz  procedure  performs  best  though  the  standard  does 
surprisingly  well 


Key  words:  Robustness  of  Validity,  Permutation  Test,  Conditional 
Inference,  Likelihood  Function 


1.  Introduction 


Most  of  the  research  in  the  area  of  robustness  has  focused  on  the 
problem  of  point  estimation  and  comparatively  little  attention  has  been 
paid  to  the  companion  problems  of  interval  estimation  and  testing.  The 
present  work  has  been  directed  at  one  of  the  simplest  realistic  testing 
problems:  The  construction  of  confidence  intervals  for  a  location 
parameter  when  samples  of  independent  observations  are  drawn  from  a 
symmetric  parent  distribution  of  unknown  location  and  scale.  The  goal  is 
to  achieve  validity  robustness  in  small  samples.  That  is,  we  seek 
procedures  for  which  the  empirical  coverage  probabilities  agree  with  the 
nominal  values  over  a  range  of  sampling  distributions. 

Our  particular  interest  has  centered  on  adapting  some  ideas  of 
Sprott  and  his  workers  on  "small-sample  asymptotics"  in  the  classical 
parametric  setting  (Sprott  and  Kalbfleisch  (1969),  Sprott  (1973),  Sprott 
(1980))  to  the  robustness  problem.  The  essential  idea  is  to  use 
the  shape  of  the  likelihood  function  as  an  indication  of  the  distribution 
of  the  maximum  likelihood  estimate.  However,  in  practical  robustness 
problems,  the  true  form  of  the  likelihood  function  is  unknown  and  our 
investigation  was  prompted  by  a  desire  to  see  whether  the  putative 
likelihood  function  (implicit  in  the  choice  of  the  estimation  technique) 
contained  useful  information  as  well. 

Our  numerical  results  suggest  that  while  Sprott 's  original  idea 
carries  over  quite  well  to  the  robustness  setting,  his  particular 
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implementation  does  noc ,  and  other  methods  are  required.  One  such  method 
is  developed  and  its  performance  compared  both  with  that  of  a  standard 
procedure  (see  Gross,  1976)  and  a  novel  permutation  approach  of  Maritz 
(1979).  Moreover,  some  connections  with  the  work  of  Efron  and  Hinkley 
on  conditional  inference  are  explored  (Efron  and  Hinkley  (1978), 

Hinkley  (1978)). 


2.  Review  of  the  Literature 

There  are  a  number  of  different  strands  that  can  be  identified  in 
the  fabric  of  research  in  this  area.  One  of  the  oldest  begins  with 
Gayen  (1949)  who  investigated  the  distribution  of  Student's  t-statistic 
when  the  data  are  not  normally  distributed.  His  corrections  are 
functions  of  the  skewness  and  kurtosis  of  the  parent.  In  practice,  these 
moments  must  be  estimated  from  the  data  and  for  small  samples  this  is 
quite  impractical.  More  recently,  Yuen  and  Murthy  (1974)  have  studied 
the  special  case  of  sampling  from  a  member  of  the  t-family  and  have 
derived  empirically  compact  approximations  to  the  distribution  of  the 
usual  t-statistic.  The  usefulness  of  this  work  has  been  diminished  by 
the  advances  made  by  Hampel  and  his  coworkers  (Hampel  (1973),  Field 
and  Hampel  (1982))  in  the  area  of  "small -sample  asymptotics." 

Another  approach  has  been  based  on  Huber's  M-estimator.  The  notion 
is  to  obtain  a  robust  estimate  of  location  and  a  corresponding  estimate 
of  its  variability  from  which  a  t-like  statistic  may  be  constructed. 
Intuitively,  the  notion  is  that  such  a  statistic  should  display 
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reasoaable  robustness  of  validity.  For  example,  Gross  (1976)  carried 
out  extensive  empirical  investigations  of  the  behavior  of  some  25 
different  statistics  under  four  different  sampling  distributions.  The 
most  successful  ones  employed  either  the  jackknife  or  a  finite-sample 
version  of  the  asymptotic  standard  deviation  of  the  estimator  to  obtain  a 
denominator  for  the  test  statistic.  Those  statistics  based  on  location 
estimates  derived  from  Hampel's  redescending  influence  function  or 
Tukey's  bisquare  performed  quite  well,  both  in  terms  of  robustness  of 
validity  and  robustness  of  efficiency. 

An  interesting  proposal  was  put  forward  by  Maritz  (1979).  He 
pointed  out  that  classical  permutation  arguments  could  be  used  in 
obtaining  confidence  intervals  for  M-estimates  of  location.  A  difficulty 
arises  when  the  scale  parameter  is  unknown  because  some  estimates  of 
scale  destroy  the  vaidity  of  the  permutation  argument.  However  other 
common  robust  scale  estimates  such  as  the  median  absolute  deviation  are 
permissible.  Because  Maritz's  procedure  conditions  on  the  absolute 
deviations  of  the  observations  from  the  center  of  the  distribution,  the 
resulting  confidence  limits  should  be  both  conditionally  as  well  as 
unconditionally  exact.  The  question  of  conditional  confidence  levels 
arises  in  the  work  of  Efron  and  Hinkley  (1978)  and  Hinkley  (1978)  and 
will  be  treated  in  Section  3  below.  Unfortunately,  Maritz  did  not  carry 
out  any  empirical  studies . 

Boos  (1980)  has  proposed  a  procedure,  motivated  by  a  solution  to 
the  problem  of  quantile  estimation,  which  may  be  thought  of  as  a  simple 
approximation  to  the  more  complex  Maritz  procedure.  It  seems  also  to 


be  related  to  a  suggestion  made  by  Bickel  (1976)  on  carrying  out  robust 
analyses  by  applying  classical  methods  to  appropriately  transformed  data. 
The  Boos  and  Maritz  procedures  are  fairly  demanding  computationally 
and  the  former  suffers  from  the  additional  constraint  that  only  non¬ 
decreasing  <Ji -functions ,  can  be  employed,  thus  ruling  out  Hampel's 
redescending  ^-functions.  Boos  carried  out  a  small  empirical  study  and 
concluded  that  his  procedure  held  a  small  advantage  in  robustness  of 
validity  and  efficiency  over  the  usual  studentization  method. 

The  above  review  has  centered  on  methods  involving  the  construction 
of  a  test  statistic  from  a  particular  estimator.  A  more  ambitious  plan 
is  to  formulate  an  optimality  criterion  for  testing  in  the  robustness 
framework  and  to  develop  test  statistics  meeting  this  criterion.  The 
censored  likelihood  ratio  test  of  Huber  (1965)  is  an  early  example  of 
a  test  of  two  simple  hypotheses  with  specific  robustness  properties. 
Ylvisaker  (1977)  and  later  Lambert  (1981)  proposed  different  approaches. 
In  particular,  Lambert  defined  an  influence  function  for  a  test  in  terms 
of  the  behavior  of  its  P-values  when  the  data  are  sampled  from  a  model 
distribution  modified  by  point  contamination.  Lambert  examines  the 
influence  functions  of  a  number  of  common  test  procedures. 

Schrader  and  Hettmansperger  (1980)  introduced  likelihood  ratio-type 
tests,  based  on  robust  loss  functions,  for  the  general  linear  model. 

As  is  customary  in  the  robustness  literature,  these  authors  advocate 
joint  estimation  of  location  and  scale  parameters  but  fix  the  latter 
at  its  estimated  value  for  hypothesis  testing  for  the  former. 
Specifically,  following  work  of  Huber  (1967),  they  show  that  the 


asymptotic  distribution  of  the  difference  in  maximized  robust  loss 
functions  from  a  full  model  to  a  reduced  model  is  proportional  to  x~(n>), 
where  m  is  the  corresponding  reduction  in  dimensionality.  The  only 
effect  of  mismatching  the  robust  loss  functions  to  the  distribution  of 
the  data  is  in  the  constant  of  proportionality.  Although  these  authors 
did  not  explicitly  demonstrate  how  their  proposal  could  be  used  for 
interval  estimation  as  well  as  hypothesis  testing,  the  extension  is 
immediate . 

Once  an  influence  function  has  been  defined,  considerations  similar 
to  those  proposed  by  Hampel  (1974)  in  the  estimation  case  can  be  brought 
to  bear  in  the  construction  of  new  test  statistics.  A  direct  extension 
of  Hampel's  influence  function  to  the  testing  arena  can  be  found  in 
Ronchetti  (1979)  and  Rousseeuw  and  Ronchetti  (1981).  Ronchetti  (1982) 
discusses  the  connection  between  this  influence  function  and  others  that 
have  been  proposed  and  suggests  that  an  appropriate  optimality  criterion 
for  a  test  of  a  simple  null  hypothesis  is  to  maximize  the  asymptotic 
power  (within  a  given  class  of  tests,  appropriately  standardized)  subject 
to  a  fixed  bound  on  the  influence  function  at  the  null  hypothesis. 
Ronchetti  (1982)  also  discusses  other  notions  such  as  the  change-of- 
variance  function,  which  is  germane  to  estimation  and  a  change -of -power 
function  which  is  germane  to  testing.  Optimal  test  statistics  are 
derived,  though  their  small  sample  properites  are  not  investigated. 

Interestingly,  Rieder  (1978)  and  Millar  (1983)  both  propose  a  simple 
test  statistic  similar  to  that  of  Boos  and  prove  certain  asymptotic 


ptimality  properties 


To  be  considered  as  a  meaningful  statement  in  a  fr  ."entist  sense, 
the  last  sentence  must  be  recast  as  follows:  Suppose  we  identify  an 
infinite  set  of  samples  generating  likelihoods  { L,( 9 ) }  under  the  model 

F.  which  differ  trivially  from  each  other  but  all  very  different  from  the 

-1  -1/2  1  A  2 
corresponding  functions  of  0 :  [2x1  (9)]  exp{-  [(6  -  0)1(8)]  } 

A 

=  1^(0).  (The  quantity  9^  changes  with  each  sample).  Then  the  claim  is 
that  the  proportion  of  the  resulting  intervals  that  cover  the  true  value 
of  9  will  differ  from  the  nominal  level;  i.e.,  the  conditional  level  of 
the  procedure  will  be  incorrect. 

They  illustrate  this  principle  with  an  example  based  on  sampling 
from  the  exponential  distribution  with  density  9  *  exp(-t/9).  The 
likelihood  function  for  9  in  small  samples  tends  to  be  asymmetric  and  the 
inferences  based  on  the  asymptotics  are  misleading.  A  solution  is 
proposed,  however. 

Working  for  convenience  in  terms  of  the  relative  likelihood 

A 

function  R(0 )  ■  L(0)/L(8),  they  suggest  finding  a  reparameterization  of 
9  for  which  the  relative  likelihood  function  is  more  nearly  normal.  The 
usual  asymptotic  theory  should  be  applied  on  the  transformed  scale  and 
the  resulting  confidence  interval  then  transformed  back  to  the  original 
6  scale.  In  the  exponential  case,  for  example,  the  transformation 
A  ■  0  works  well.  The  relative  likelihood  R(A)  »■  L(A)/L(A) 
tend  to  be  quite  normal  looking  (very  little  asymmetry)  and  the  actual 
confidence  level  matches  the  nominal  level.  Thus  the  relevance  of  the 
usual  asymptotics  seem  not  to  be  a  function  of  the  sample  size  per  se 


but  rather  of  the  shape  of  the  (relative)  likelihood  functions 

generated. 

Sprott  (1973)  expands  on  this  approach  by  carrying  out  a  formal 
Taylor  series  expansion  of  log  R(0 )  about  0 : 

*  3 

log  R(0  )  -  -  4  (0  -  0)2  1(0)  +  i  (6  -  Q)3  4-=-  log  R(0 )  +  ...  . 

2  6  303  (3.2) 

The  first  term  on  the  RHS  of  (3.2)  corresponds  to  the  relative  likelihood 
implicit  in  (3.1),  which  we  denote  by  R^(0).  If  the  second  term  on  the 
RHS  of  (3.2)  is  generally  nonneglibible  then  the  use  of  R^(0)  alone  as  a 
basis  for  inference  is  suspect.  On  the  other  hand,  if  a  transformation 
A  *  A(0)  can  be  found  for  which  the  second  term  is  generally  very  small, 
or,  ideally,  identically  zero,  then  the  use  of  the  asymptotics  is  better 
justified. 

Rewriting  (3.2)  we  have 

A  3 

log  R(0)  =  j  (0  -  0)2I0{1  -  -j  (0  -  0)3IgL  —  log  R(0 )}  +  ...  . 

3  0 

If  attention  is  confined  to  the  region  j0  -  0|  <  k/Il,/2(0),  then  it 
seems  sensible  to  define  a  measure  of  deviation  from  normality  by 


Under  Che  transformation  A  =  A (9),  Sprott  showed  that 


F  (A)  -  F  (0)+  3I_1(0 )  ^  >_1 

3  3  d0 2  d0 


(3.4) 


If  A  can  be  found  to  make  F^(A)  zero  or  more  nearly  so  than  F^(9), 
then  the  normal  approximation  should  work  better  on  the  A-scale. 

Sprott  also  employs  some  results  of  Welch  and  Peers  (1963)  to 
explicate  the  connection  between  normal  relative  likelihoods  and  the 
approximate  normality  of  the  maximum  likelihood  estimate.  Briefly, 
standard  asymptotics  implies  that 


v(0 )  -  (6  -  0)  I  '  (0) 


is  approximately  distributed  as  a  standard  normal  deviate.  Welch  and 
Peers  showed  that 


Z(0,0)  -  v(0)  (v2(0)  +  2)  F3(0)  +  h(0 ,0 )  , 


where  h  is  complicated  function  of  0  and  0 ,  is  more  nearly  a  standard 
normal.  Sprott  argues  that  any  transformation  A  *  A(0)  that  reduces 
(F^i  will  make  the  resulting  v(A)  more  nearly  a  linear  function  of 
Z(A,A)  and  hence,  improve  the  accuracy  of  the  normal  approximation  to  the 
distribution  of  v(A )  -  (A  -  A)I^2(A), 

Sprott  (1980)  extends  these  ideas  to  the  case  when  nuisance 
parameters  are  present.  Let  X  -  (X,,X_,...,X  )  be  a  random  sample  of 
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n  independent  observations  from  a  distribution  F  in  the  family  { }  , 
indexed  by  a  vector  parameter  6  *  (6 ^,...,0^).  The  density  of  F  is 
denoted  by  f.  The  problem  is  to  estimate  6^  in  the  presence  of  the 
nuisance  parameters  0 9^.  The  relative  maximum  likelihood  function 
of  9j  is  defined  to  be 

R^e^X)  -  f(X;9*)/f(X;0) 

where  f(X;0)  -  II  f(x  ,0)  (the  likelihood  of  the  data  as  a  function  of 
i 

0),  0  -  MLE  of  0  and  0*  -  (0 ^ ,0^ , . . . ,0*)  is  the  restricted  MLE  of  9  for 
given  value  of  0^.  In  the  following,  notational  dependence  on  X  will 
be  suppressed. 

Let  L  -  log  f(X;0)  and  define 


*  _3  2. 
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One  approach  to  inference  about  0  ^  is  to  focus  on  the  so-called  pivotal 
quantity 


uCOp  - 
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It  can  be  shown  that 


</ 


3  log  R^(6  )  n  a  2 
- ."A—-  [I1  (8))  \ 

"l 


As  in  the  single  parameter  problem,  if  a  transformation  X  =  X(6^)  can 
be  found  to  reduce  F^  (and  F^  as  well),  the  resulting  relative  maximum 
likelihood  function  will  tend  to  be  more  normal  in  repeated  samples  and 
the  normal  approximation  to  the  distribution  of  the  pivotal  quantity  u 
more  defensible. 

In  one  of  his  examples,  Sprott  (1980)  considers  the  case  of  sampling 
from  the  t-family  of  location  and  scale  distributions.  Because  of  the 
population  symmetry,  F^(0)  tends  to  be  quite  small  but  F^(0)  may  not  be 
negligible  unless  the  sample  size  is  fairly  large.  To  deal  with  this 
problem,  Sprott  suggests  a  simple  device.  Instead  of  approximating 
by  »  normal  curve,  a  t-curve  should  be  used  to  account  for  the  fact 
that  in  small  samples  F^  tends  to  be  positive  indicating  less  precision 
in  the  data. 

Now  the  value  of  F,  at  the  mode  of  the  relative  maximum  likelihood 
A 

function  of  a  t-distribution  on  M  degrees  of  freedom  is  6(M  +1)  *.  Thus 
the  approximating  t-curve  to  log  is  found  by  solving  the  equation 

A  A  A 

F^(0)  *  6(M  +1)  .  Denote  the  solution  by  M  •  M(0).  Following  the 

logic  enunciated  above,  we  would  then  suppose  that  u(0^)  is  distributed 
approximately  as  tA  and  set  the  appropriate  confidence  limits  for  0.. 


Sprott’s  work  is  based  on  the  conjecture  that  contains  useful 
information  about  the  behavior  of  u  and  he  simply  provides  a  convenient 
way  of  extracting  the  information  in  by  a  simple  approximation  using 

a  tabled  distribution.  A  computationally  burdensome  approach  would 
involve  a  more  detailed  look  at  along  the  lines  suggested  by  Fraser 
(1976). 

Although  the  above  discussion  has  been  carried  out  in  the  classical 
setting,  it  is  perfectly  feasible  to  apply  in  the  robustness  setting. 

In  the  language  of  M-estimation ,  suppose  we  choose  a  ^-function 
corresponding  to  a  model  distribution  F.  Data  are  obtained  from  an 
unknown  distribution,  denoted  G.  A  (pseudo)  relative  maximum  likelihood 
function  is  constructed  based  on  the  assumed  model  F  and  the  observed 
data  from  G.  The  shape  of  this  function  is  approximated  by  an 
appropriate  t-curve  and  the  latter  is  used  to  set  confidence  limits  for 
the  parameter  of  interest.  For  example,  we  may  choose  F  to  be  the 
location-scale  family  for  t ^  which  corresponds  to  choosing  a  reasonable 
redescending  if>-f unction  for  estimation,  while  we  may  sample  from  a 
member  of  the  slash  family  (Rogers  and  Tukey,  1977).  Empirical  studies 
must  determine  whether  the  pseudo-relative  maximum  likelihood  function 
does  indeed  carry  useful  information. 

In  some  respects,  Sprott's  work  is  closely  related  to  that  of 
Efron  and  Hinkley  (1978)  and  Hinkley  (1978)  on  conditional  likelihood 
inference.  They  argue  that  in  general  the  observed  information  rather 
than  the  expected  information  (Fisher  information)  is  a  better  guide  to 
the  variability  of  the  maximum  likelihood  estimate,  conditional  on  an 
appropriate  ancillary  statistic.  In  the  case  of  a  single  location 


parameter,  Efron  and  Hinkley,  building  on  Fisher's  work  (1934)  on 
likelihood  inference,  show  that  the  conditional  distribution,  f  (0|a), 

U 

of  the  MLE  0  of  0  given  the  ancillary  statistic  of  order  statistics 

spacings  is  proportional  to  the  likelihood  function  of  0.  Thus,  a 

normal  shaped  likelihood  does  indeed  imply  a  conditional  normal 
* 

distribution  of  0.  Moreover,  the  variance  of  this  conditional  normal 
-1  ' 

distribution  is  i  (0),  the  reciprocal  of  the  observed  information. 

Hinkley  extends  this  result  to  the  location-scale  case.  The  joint 
conditional  distribution  of  the  MLEs  given  the  appropriate  ancillary  is 
again  proportional  to  the  likelihood  function.  The  conditional 

a  a 

distribution  of  the  pivot  (0 ^  -  61)/62  is  asymptotically  normal  with 
mean  0  and  variance  i11.  In  an  example  based  on  sampling  from  the 
Cauchy  distribution,  Hinkley  demonstrates  the  superiority  of  the 
observed  information  to  the  usual  Fisher  information  as  a  measure  of  the 
variability  of  the  pivot,  though  the  final  recommendation  is  to  set 
confidence  limits  through  a  direct  approximation  of  the  likelihood 
function. 

Thus  Sprott,  Efron  and  Hinkley  all  use  the  shape  of  the  observed 
likelihood  function  to  provide  some  indication  of  the  appropriate  measure 
of  variability  to  be  attached  to  a  point  estimate  of  location.  They 
focus  on  how  the  shape  of  the  likelihood  function  may  invalidate  the 
application  of  the  usual  unconditional  asymptotics.  Efron  and  Hinkley 
suggest  conditioning  as  a  remedy  while  Sprott  argues  that  transformations 
are  a  better  way  of  dealing  with  the  problem.  However,  when  an 
appropriate  transformation  cannot  be  found  or  applied,  Sprott  does 
consider  alternative  ways  of  approximating  the  likelihood  function. 


l.  \ 


I 


4.  The  Data 


In  the  simulations  conducted  for  this  study,  data  from  the  t-f airily 
and  slash  family  were  generated.  Standard  pseudo-random  number 
generators  of  uniform  and  unit  normal  deviates  were  employed.  Denoting 
them  by  u  and  n  respectively  let  s  *  n/u^V.  Then  s  is  said  to  follow 
the  slash  distribution  with  v  degrees  of  freedom  provided  n  and  u  have 
been  generated  independently.  Variates  from  the  t-family  were  generated 
using  the  "ratio-of-unif orms"  method  described  in  Kinderman  and  Monahan 
(1977). 


5.  The  Simulation 


The  simulation  study  investigates  the  properties  of  four  procedures: 
(1)  AST— A  standard  procedure  based  on  the  asymptotic  studentization  of 
an  M  estimate;  (2)  Tt — the  procedure  proposed  here  based  on  an  extension 
of  Sprott's  work  and  employing  a  t-curve  approximation  to  the  observed 
pseudo-likelihood  based  on  matching  fourth  derivatives;  (3)  Tn — as  (2) 
above  except  that  a  normal  approximation  to  the  observed  pseudo¬ 
likelihood  is  used;  (4)  M — the  Maritz  procedure. 

All  four  procedures  generate  confidence  intervals  based  on  the  use 
of  the  same  ((/-function  corresponding  to  the  choice  of  model.  The  first 
three  require  calculation  of  the  appropriate  (robust)  estimates  of 
location  and  scale.  The  E-M  algorithm  (Dempster,  Laird,  and  Rubin, 

1977)  was  used.  In  order  to  implement  Tt ,  all  derivatives  of  the 
logarithm  of  the  pseudo-relative  maximum  likelihood  function  up  to  fourth 
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order  must  be  computed  in  order  that  the  quantity  ( 0 )  can  be  obtained 
easily  (see  Sprott,  1980,  p.  516-517).  More  detailed  descriptions  of  the 
procedures  follow  below. 

(1)  AST  -  A  100(1  -  o)  confidence  interval  is  given  by 


where 


(61-  tn_1(a/2)Sn~1/2,  6X  +  t  n_1(a/2)Sn_1/2)  , 


S  -  [(n  -  1)  Z 
i 


-1  2  xi  ”  ® l  1/2“  -1  xi  ”  ®i 

1  z  *2(  1  )]1/2e  / n  1  I  *»(  -i-v  1  ) 

i  e2  1  02 


tv(a)  -  100(1  -  a)  percent  point  of  Student1 s-t  on  v 

degrees  of  freedom. 

Remark ;  An  intuitive  interpretation  of  AST  is  obtained  by  considering 

A  A 

the  pivotal  (0 ^  -  0^)//v  (0)  ,  where 


»  «  x  -  6  . 

v(0 )  -  0,/r  ♦'(  — -  )  . 

62 


When  the  putative  likelihood  function  defined  by  ip  matches  the  underlying 
distribution,  this  pivotal  is  approximately  distributed  as  a  standard 

A 

normal  variate.  When  these  functions  do  not  match,  v(0)  misestimates  the 

A 

variance  of  0 ^  and  the  pivotal  must  be  rescaled  for  approximate  standard 


normality  to  be  obtained.  The  correction  used  in  AST  is  given  by 

A  A 

c  (x)  -  I  »•(  X±  'a  6 1  )/Z  *2(  **■  v  -1-  ) 

*  "  6  2  6  2 
a 

so  that  v(0)/c^(x)  is  the  appropriate  variance  estimate  of  0j  .  This 
interpretation  will  prove  useful  when  we  contrast  the  methods  below. 

(2)  Tt  -  A  100(1  -  a)  confidence  interval  is  given  by 

<0j  ~  tM(a/2)[iU(6)]1/2.  0X+  tM(a/2)[iU(6)]1/2)  , 

where  M  is  the  solution  to  F^(0)  *  6(M  +1) 

(3)  Tn  -  A  100(1  -  a)  confidence  interval  is  given  by 

(0j  ~  z(a/2)[iU(0)/c]1/2.  0j  +  z(a/2)[iU(0)/c]1/2)  , 


where 

z( a/2)  *  100(1  -  a)  percent  point  of  the  standard  normal 

distribution. 

c  “  c(x)  ■  a  moderating  factor  for  the  usual  asymptotic  variance 
that  depends  on  the  data  observed. 

Remark:  Both  Tt  and  Tn  stem  from  the  notion  that  the  usual  conditional 

asymptotic  statement,  namely  that  (0^-  0 j ) / [ i  (0)1  is  distributed  as  a 

standard  normal  variate,  is  not  valid  in  small  samples.  Tt  uses  instead 

a  t -approximation  to  the  observed  pseudo-likelihood  while  Tn  adjusts  the 

11  "  1/2 

asymptotic  variance  [i  (0)1  ,  again  by  recourse  to  the  observed  pseudo¬ 


likelihood.  In  other  words,  Tn  behaves  as  if  the  likelihood  is  indeed 


normal  but  with  a  spread  that  may  differ  substantially  from  that 


suggested  by  the  asymptotics. 

In  this  simulation  we  have  used  a  fairly  crude  method  to  determine 
the  adjustment  factor  c(x) .  The  relative  maximum  likelihood  function 
R^* )  is  evaluated  at  three  pairs  of  points  symmetrically  placed  about 
0^.  For  k  *  1,2  and  3,  let 


Wk  ”  7  [RM(®1  +  k(iU(i)>1/2>  +  V®!  '  k(iH(9))1/2)] 


-2-1 

Then  set  c^  **  -2k  log  and  c(x)  *  (c^  +  +  c^).  Thus  c(x) 

represents  a  compromise  among  three  estimates  of  the  required  scaling 
factor,  obtained  by  looking  at  different  points  on  the  shoulders  of  R^f*) 
If  R^  is  exactly  normal  then  c(x)  *  1  . 

(4)  M  -  A  100(1  -  a)  confidence  interval  is  given  by  ( 0 ^ , 6 ^ ) 
where  6^  and  0^  are  solutions  to  the  equations 

l  sgn(x  -  t)i|i(jX  -  tj/s)  **  ±  z(a/2)[  Z  <|>2(jx  -  1 1  /s )  ]  1/2 

i  i 

where  s  »  med( jx  -  tj}  . 

Remark ;  The  above  prescription  actually  represents  a  convenient  normal 
approximation  to  the  full  permutation  distribution  derived  by  Maritz 
(1979).  He  noted  that  the  usual  permutation  argument  applied  to  means  or 


nonparameteric  statistics  like  the  Wilcoxon  signed  rank  statistic  could 


could  also  be  applied  to  M-estimates.  If  we  define 
M^(x,t)  =  Z  sgn(xi  -  t)<Kixi  -  tj)  , 


then  the  M-estimate  of  location  for  the  data  x  is  the  solution  of  the 
equation  M^(x,t)  =  0  .  To  obtain  a  (1  -  2r/2n)  two-sided  confidence 
interval  (t^.t^),  the  values  t^  and  C2  must  be  determined  by  finding  the 
rC^  smallest  and  the  largest  values  of  t  solving  the  equations 


Z  sgn(x'  -  t)iKjx'  ~  tj)  =■  0 


where  the  summation  is  over  i  -  l,2,...,s  and  s  =  l,2,...,n.  That  is 

we  consider  all  possible  solutions  to  the  basic  equation  when  the  data 

are  allowed  to  vary  over  all  subsets  of  the  original  data.  The  desired 

t  h  t  h 

values  tj  and  t ^  are  the  r  smallest  and  r  largest  of  these  solutions. 

In  practice  this  calculation  is  somewhat  demanding  so  that  a  normal 
approximation  is  recommended.  Secondly,  an  estimate  of  scale  is  often 
needed  and  one  that  is  a  function  of  the  absolute  deviations  ( x ^  -  tj, 
jx2  -  t|,*.»,|x  -  t|  may  be  employed  without  disturbing  the  permutation 

argument . 

Remark ;  Boos'  (1980)  procedure  is  essentially  equivalent  to  Maritz's 
except  that  s  is  replaced  by  some  fixed  estimate  of  scale  (not  depending 
on  t)  and  a  t  distribution  on  (n  -  1)  degrees  of  freedom,  rather  than  the 
normal  distribution,  is  employed  as  the  reference. 


s 


6.  Results 


The  results  of  the  major  simulations  are  presented  in  Table  1. 

While  Tt  performs  adequately  when  the  data  are  generated  from  the 
t-family,  it  breaks  down  quite  badly  when  slash-family  data  are  used.  On 
the  other  hand,  Tn  performs  quite  well  throughout  though  it  is  somewhat 
inferior  to  AST  for  slash  data  (but  superior  to  AST  for  t ^  data). 

The  Maritz  procedure  performs  about  as  well  overall  as  AST  and  somewhat 
better  in  fact  for  t ^  data. 


Insert  Table  1  about  here 


It  seems  clear  that  the  device  of  approximating  by  a  t-curve 
based  on  matching  fourth  derivatives  at  the  mode  is  too  sensitive  to  the 
shape  of  the  underlying  parent  distribution.  Figure  1  displays  a  typical 
R^  and  its  approximations  by  t-curves  and  a  normal  curve.  The  t-curve 
approximation  is  quite  poor  while  the  normal  curve  one  is  excellent. 


Similar  experiments  were  run  for  samples  of  size  10  and  40  but  as  the 
comparisons  are  qualitively  the  same  as  for  size  20,  they  are  not 
presented  here. 


Insert  Figure  1  about  here 


The  conditional  coverage  probabilities  of  the  procedures  are  also 
of  some  interest,  especially  in  light  of  the  Ef ron-Hinkley  proposals. 
Table  2  presents  results  for  four  combinations  of  data  and  model  which 


which  are  illustrative  of  the  results  obtained  for  the  full  set  of 


of  combinations  depicted  in  Table  1.  For  each  combination,  the  coverage 


probabilities  for  the  three  procedures  are  based  on  the  same  set  of  1000 

samples  which  have  been  divided  into  thirds  based  on  the  value  of  i**(0). 

The  cut  points  are  provided.  There  is  an  obvious  pattern  of  increasing 

coverage  probability  with  increasing  i^(0).  Of  course,  ideally  there 

should  be  no  trend  with  i^(0).  Particularly  in  the  case  of  the  Tn 

11  * 

procedure,  it  appears  as  if  the  low  observed  values  of  i  (0)  are  "too" 
low  while  high  ones  are  "too"  high. 

Employing  jackknifed  values  of  i^  in  the  construction  of  Tn 
confidence  intervals  immediately  suggests  itself  as  a  possible  remedy. 
However  a  small  experiment  based  on  jackknifing  log  i*1  and  then  trans¬ 
forming  back  did  not  give  promising  results.  While  the  Maritz  procedure 
performs  better  than  the  others  as  one  would  expect  from  its  theoretical 
properties,  its  conditional  coverage  probabilities  do  follow  the  same 
trend  as  those  of  the  others. 

Insert  Table  2  about  here 

11  * 

Recall  now  that  Hinkley  (1978)  found  that  i  (0)  seriously  under- 

A 

estimates  the  variance  of  0 ^  when  sampling  from  the  Cauchy  distribution 
(n  ■  20).  Our  findings  support  this.  In  Table  3,  we  display  the  median 
values  of  c(x)  for  each  combination  of  model  and  data,  each  again  based 
on  a  run  of  1000  samples.  The  quantities  in  parentheses  are  the  ratios 
of  the  interquartile  range  to  the  median  for  each  batch  of  1000  values  of 
c(x).  That  c(x)  does  not  vary  much  across  samples  within  a  given 


sampling  situation  explains  why  it  does  not  also  vary  across  the  sampling 


situations  investigated  here.  In  any  case,  the  values  of  c(x)  are 

11' 

substantially  less  than  unity  indicating  that  the  use  of  i  (6)  alone 
would  produce  intervals  much  too  liberal. 

Insert  Table  3  about  here 


7 .  Concluding  Remarks 

Our  results  agree  with  those  of  Sprott  in  the  sense  that  the  applicability 
of  asymptotic  normality  depends  most  on  the  shape  of  the  observed  likeli¬ 
hood  function  than  on  sample  size.  Our  results  generalize  his  since  our 
experiements  demonstrate  that  pseudo-likelihoods,  corresponding  to  robust 
M-estimates,  also  share  this  property. 

Specifically,  our  results  indicate  that  the  small  sample  distribution 
of  the  pivotal 


8,-0 


(7.1) 


is  approximately  normal  with  mean  zero,  but  with  nonunit  variance.  Thus, 

A 

the  observed  information  is  not  a  good  variance  estimate  for  0^  in  small 
samples  with  unknown  scale.  This  holds  true  whether  or  'not  the  estimating 
function  matches  the  distribution  of  the  data. 


The  observed  (pseudo-)  likelihood  function  can  be  used  to  rescale 
the  pivotal  to  obtain  an  honest  small  sample  variance  estimate.  We  chose 
to  do  this  directly  by  approximating  the  rescaling  factor  by  comparing  the 
observed  relative  maximized  likelihood  function  to  a  family  of  normal 
likelihoods.  This  proved  quite  successful  albeit  somewhat  ad  hoc. 
Alternative  direct  approximation  methods  might  prove  slightly  better. 

We  have  not  been  successful  in  determining  an  algebraic  expression 
for  the  rescaling  factor  to  be  applied  to  the  pivotal  (7.1).  Sprott's 
suggestion  of  matching  the  4th  derivative  of  an  approximating  t-family  did 
not  perform  well.  Our  current  conjecture  is  that  the  correct  variance 

A 

estimate  of  6  ^  when  8  2  is  unknown,  is  the  observed  value  of 
[i_1(e)A(0)i‘1(0)ln  where  A (9 )  -  Is^e)^)  and  •  (6)  is  the 
contribution  to  the  score  vector  (location  and  scale)  of  the  ith 
observation.  This  follows  directly  from  Huber's  (1967)  general  results 
concerning  the  asymptotic  distribution  of  robust  M-estimates.  Our 
conjecture  is  that  this  expression  is  valid  both  conditionally  and 
unconditionally,  and  that  it  should  be  used  routinely  to  set  confidence 
intervals  for  location  parameters,  even  in  situations  where  the  estimator 
matches  the  distribution  of  the  data!  This  variance  estimate  tends  to 
correct  for  both  small  sample  sizes  and  the  mismatch  of  estimating 
function  and  data  distribution. 

Our  research  has  identified  some  open  problems  in  the  areas  of 
conditional  inference  and  robust  testing.  We  were  aware  of  possible 
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connections  among  these  areas  prior  to  our  research  reported  here,  and  we 
are  more  strongly  convinced  that  the  results  of  Hinkley  (1978)  can  be 
extended  to  situations  where  the  putative  likelihood  function  does  not 
match  the  distribution  of  the  data.  Theoretical  results  along  these 
lines  would  be  very  useful  in  practice  since  they  not  only  provide 
computationally  cheap  alternatives  to  the  Maritz  procedure,  but  are  easily 
extended  to  the  general  linear  model. 

A  question  mark  also  remains  as  to  why  the  AST  method  performs  so 
well.  In  particular,  why  joint  estimation  of  location  and  scale 
parameters,  followed  by  testing  methods  assuming  known  scale  (fixed  at 
its  estimated  value)  provides  such  good  coverage  probabilities.  We  argue 
that  this  is  a  general  phenomenon  not  well  understood  in  the  statistical 
community.  In  particular,  in  certain  applications  of  Bayesian  methods, 
fixing  nuisance  parameters  at  their  estimated  values  rather  than  providing 
priors  for  them  often  yield  extremely  accurate  inferences.  Clearly  much 
needs  to  be  done  to  understand  the  mechanism  behind  these  empirical 
findings. 
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Table  2 


Conditional  95  Percent  Coverage  Probabilities 


Data  Model  i1 1 C6 )  AST  T  M 

-  n 


Cut 

values 

of 


Table  3 


Medial  Values  of  Pivotal  Rescaling  Factor,  c(x) 


n  ■  20 


C2 

C5 

C10 

S2 

S5 

S10 

.57 

.54 

.53 

.55 

.53 

.52 

(.09) 

(.08) 

(.08) 

(.095) 

(.095) 

(.09) 

.71 

.70 

.69 

.71 

.69 

.69 

(.04) 

(.025) 

(.02) 

(.04) 

(.02) 

(.02) 

.79 

.78 

.78 

.78 

.78 

.78 

(.02) 

(.006) 

(.004) 

(.01) 

(.005) 

(.004) 

n 

-  10 

.54 

.52 

.51 

.53 

.51 

.51 

(.11) 

(.10) 

(.09) 

(.13) 

(.10) 

(.10) 

.66 

.66 

.65 

.66 

.66 

.65 

(.035) 

(.02) 

(.02) 

(.04) 

(.02) 

(.02) 

.73 

.73 

.73 

.73 

.73 

.73 

(.004) 

(.001) 

(.0017) 

(.003) 

(.002) 

(.0017) 
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lative  likelihood  and  two  approximations  for  a  sample  from  Slash  (2df) 
mates  based  on  a  student's  t  (2df) 


